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The simplest null models for networks, used to distinguish significant features of a particular 
network from a priori expected features, are random ensembles with the degree sequence fixed by 
the specific network of interest. These "fixed degree sequence" (FDS) ensembles are, however, fa- 
mously resistant to analytic attack. In this paper we introduce ensembles with partially-fixed degree 
sequences (PFDS) and compare analytic results obtained for them with Monte Carlo results for the 
FDS ensemble. These results include link likelihoods, subgraph likelihoods, and degree correlations. 
We find that local structural features in the FDS ensemble can be reasonably well estimated by 
simultaneously fixing only the degrees of few nodes, in addition to the total number of nodes and 
links. As test cases we use a food web, two protein interaction networks (E. coli, S. cerevisiae), the 
internet on the autonomous system (AS) level, and the World Wide Web. Fixing just the degrees 
of two nodes gives the mean neighbor degree as a function of node degree, {k')k, in agreement 
with results explicitly obtained from rewiring. For power law degree distributions, we derive the 
disassortativity analytically. In the PFDS ensemble the partition function can be expanded diagram- 
matically. We obtain an explicit expression for the link likelihood to lowest order, which reduces in 
the limit of large, sparse undirected networks with L links and with fc max < L to the simple formula 
P(k,k') = fcfc'/(2L + kk'). In a similar limit, the probability for three nodes to be linked into a 
triangle reduces to the factorized expression Pa(&i, ki, kz) = P(ki, k2)P(k\, ks)P(k2, kz). 

PACS numbers: 02.50.Cw, 02.70.Uu, 05.20.Gg, 87.10,+c, 87.23. Cc, 89.75.Fb, 89.75.Hc 



I. INTRODUCTION 

A pivotal question of empiricism is the degree to which 
the results of an observation are expected. In ideal cases, 
either predictions based on these expectations remain 
valid in view of new measurements, or the expectations 
have to be changed. But this clear distinction is often 
blurred by uncertainties resulting from measurement er- 
rors, imprecision of model parameters, or the impossi- 
bility of extracting exact predictions from complicated 
models. Whether or not the problem at hand is a typ- 
ical instance of a wider class of problems that are al- 
ready understood is a question of statistical inference. In 
rare cases, the consequences of the expectations (or the 
model) can be derived analytically prior to observation. 
If this is not feasible, a widely used strategy is to con- 
struct a large number of "surrogates" [l[ , or instances of 
a well-defined null model encapsulating the expectations, 
and to compare the actual observations to this artificial 
data. 

Constructing surrogates is equivalent to simulating a 
statistical ensemble. In choosing weights for the ensemble 
of surrogates one often uses Occam's razor — no outcome 
compatible with the null hypothesis should be preferred, 
and all such outcomes are equally likely. This is sim- 
ilar to Jaynes' construction of statistical mechanics by 
maximizing Shannon entropy with physically meaningful 
constraints. Consequently, the numerical construction of 
surrogates often uses Monte Carlo methods Q similar to 



those used in statistical mechanics. 

This paper addresses properties of ensembles used as 
null models for complex networks. Predictions based 
on the null models fix expectations, and thereby de- 
termine whether or not deviations in the properties of 
an actual network are functionally or historically signifi- 
cant. While the numerical construction of surrogates of 
these ensembles has received attention in the recent liter- 
ature [1,0, [f|, much less is known about analytic methods 
(see discussion below). 

Nowadays networks attract enormous interest as repre- 
sentations of complex systems. They take various guises 
in biological, social, technological and physical contexts. 
The nodes designate distinct degrees of freedom (e.g 
agents, species, genes, magnetic concentrations in the 
solar photosphere, or earthquakes) and the links iden- 
tify primary interactions or relationships between pairs of 
nodes (e.g. co-authorship, predator-prey relations, gene 
regulation, magnetic flux tubes, or seismic correlations). 
For examples see @, 1 S Efl El, [13, O . The ubiq- 
uity of networks and their relatively easy visualization as 
graphs, together with notions of universality prevalent 
in the physics community, have driven speculations that 
the structure of networks can shed light on fundamental 
principles of social or biological organization, such as po- 
litical behavior, ecosystem dynamics, brain function or 
the regulated homeostasis of organisms. 

At the simplest level, networks are purely static en- 
tities, with each pair of distinct nodes connected by no 
more than one edge (or "link"). If, in addition, the inter- 
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action strength is disregarded (which often is a very useful 
simplification) the adjacency matrix M. for the graph is 
a square (0,1) matrix. If My = 1 then an edge points 
from node i to node j; if My = then the edge is absent. 
Without self-interactions, Ma = 0. For undirected net- 
works, the adjacency matrix is symmetric, My = Mji. 
The degree ki of node i is then defined as the number of 
edges incident on it, ki = Xy Mji. Several reviews may 
be found in Refs. HEIEU]. 

Section II defines more precisely the network ensembles 
(or null models) we consider in this paper. Our analyti- 
cal methods focus on ensembles where the total number 
of links and nodes in the network is specified as well as 
the degrees of a small subset of nodes. These are called 
ensembles of "partially fixed degree sequence" (PFDS). 
Analytic predictions based on the PFDS ensembles can 
be compared with numerical results from a 'rewiring' al- 
gorithm for ensembles with fixed degree sequence (FDS), 
where the number of links attached to every node in the 
network is simultaneously specified. Section III mainly 
recalls previous results. We review Monte Carlo meth- 
ods for sampling the FDS ensemble. Then we discuss 
how such null models can be used, and we conclude by 
recalling previous analytic approaches. Section IV dis- 
cusses some results derived later in Section V, namely 
analytic estimates of the link linkelihood py (the linkeli- 
hood for a link to connect nodes i and j). It uses them 
to make predictions for the average nearest neighbor de- 
gree (fc')fc and for disassortativity. The calculation gives 
an excellent description of (k')k for large k, e.g. for an 
Escherichia coli protein interaction network and an AS 
level map of the Internet. We also compute (k')k ana- 
lytically for the case where the degree distribution is a 
power law, using Eq. ([I]) given below. In that case, the 
naive approximation for P(k,k') would give divergent or 
ill-dcfincd results. 

Section V contains our main analytic results. In order 
to keep the notational clutter of this section to a mini- 
mum and to emphasize the intuitive nature of the results, 
most intermediate steps are moved to Appendix B. In the 
limit of large sparse networks with L links and with the 
maximal degree much less than L, we find that the link 
likelihood depends only on L and on the degrees k and 
k! of the two nodes, and is given by 



P(k,k') « kk'/{2L + kk') 



(1) 



This improves substantially over the widely used 'naive' 
approximation P(k,k') w kk'/2L. We also find that 
the disassortativity of the FDS ensembles correspond- 
ing to several real world networks is well-described by 
PFDS ensembles simultaneously fixing the degrees of 
two nodes at a time. Finally, we find an expression 
for the likelihood of a triangle, which factorizes in the 
same limit of large sparse networks (and when all three 
degrees are much larger than 1) to -Pa^Ij &2i fe) = 
P{ki, k2)P(k\, ks)P(k2, ks), with P(k, k') given again by 
Eq. |T|). The paper ends in section VI with a discussion 
and an outlook to further problems. 



II. NULL MODELS FOR NETWORKS 

A. Erdos-Renyi (Undirected) Graphs 

The simplest null hypothesis is that a given network is 
completely random, not even the number of links be- 
ing specified. The only constraint is on the number 
of nodes, which is assumed to be N. Each pair of 
nodes may be joined with at most one link. Hence, the 
number of labelled undirected graphs with fixed N is 
Z (N) = 2 Ar ( Ar - 1 '/2. This quantity is the number of ways 
undirected links may be placed in (^) = N(N - l)/2 
possible positions. A statistical ensemble is obtained 
by assigning weights to each graph. The most natural 
choice is to weigh each graph with L links by a factor 
p L , where p is the probability that two given nodes are 
connected by a link. This gives the average number of 
links as L = pN(N — l)/2. The average degree of a node, 
i.e. the average number of links attached to it, is then 
k = 2L/N = p(N — 1), and the degree distribution is 
binomial. In the limit of sparse networks, where p — > 
for N — > oo such that pN — > const., the degree distri- 
bution (or the probability P(k) that a node has k links) 
becomes Poissonian, 



P(k) 



k k exp 
k~T 



(2) 



While this ensemble can be viewed as a "grand canon- 
ical" version of the Erdos-Renyi ensemble [l6[ since the 
particle fugacity is fixed, it is more customary to asso- 
ciate Erdos-Renyi graphs with a different ensemble where 
the total number L of links is fixed, rather than just the 
average L. Park and Newman refer to the ensemble with 
fixed L as "canonical" fli , [l9j , making an analogy be- 
tween the number of links and the number of particles in 
traditional statistical mechanics. However, we shall refer 
to this ensemble, and ensembles with similar hard degree 
constraints, as microcanonical. 

Excluding self-connections as well as multiple edges 
between any pair of nodes gives 



Zx(N,L) 



{ig,N} 



(3) 



distinct labelled, undirected graphs with fixed N and L 
fl7l |. The subscript u {lg,N}" indicates a sum over la- 
belled graphs with N nodes, while the superscript {L} 
on the sum indicates, as in later formulae, the constraints 
on the edges. The subscript "1" on Z indicates that Z\ is 
the number of undirected graphs with one (global) hard 
constraint on the links, just as Zq(N) is the number of 
undirected graphs with zero constraints on the links. 

With no further knowledge or constraints on the net- 
work, Occam's razor suggests assigning equal weight to 
each labelled graph satisfying all the hard constraints. 
This corresponds exactly to the construction of micro- 
canonical ensembles in statistical mechanics. For Z\, 
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each node has equal probability to be connected to any 
other node. It is easy to show [2(j that the distribution 
for the number of links attached to each node is again 
Poissonian for sparse networks with large N, where the 
grand canonical and microcanonical ensembles become 
equivalent. 

In contrast, observations of real networks reveal fat- 
tailed degree distributions, which differ starkly from the 
situation where each node has equal likelihood to be con- 
nected to any other node. The most salient consequence 
is that the average degree k fails to characterize the con- 
nectivity of the nodes; in particular it cannot account for 
the dominant nodes or "hubs" with many links, which 
would not typically appear in the Erdos-Rcnyi ensemble. 

B. Ensembles with Fixed Degree Sequences 

As a result, attention has moved to ensembles that 
build additional information into the null hypothesis 
about the "distinguishability" or diversity of the nodes. 
Although many different and equally plausible ways to 
account for diversity can be imagined, to begin we focus 
on the most popular contemporary method. This uses 
the random ensemble of labelled graphs with fixed de- 
gree sequence (FDS) as the relevant null model. The 
complete degree sequence simultaneously fixes all the 
one-node properties for each member of the ensemble, 
without reference to their relationships (or links) in the 
network. Obviously, it is straightforward to obtain the 
degree sequence for any network, and there exist numer- 
ical methods to estimate characteristic properties of the 
corresponding FDS ensemble (see section III). 

The microcanonical FDS ensemble is specified by as- 
signing a specific degree (= number of links) to each node, 
ki for i = 1, TV, and giving equal weight to each graph 
with this degree sequence, while giving zero weight to all 
those graphs which have a different degree sequence. The 
null hypothesis for any observable pertinent to a specific 
graph Q with adjacency matrix A4g is then obtained by 
taking its expectation value in the FDS ensemble with the 
same degree sequence. For undirected graphs excluding 
self-interactions, the FDS partition sum is the number 
of symmetric (0,1) matrices with zeroes on the diagonal 
and with fixed marginal sums, which can be written ac- 
cording to our previous convention as 

{fci ...,fcjv} 

Z N (N,L,kx,k2,...,k N -x)= 1 - ( 4 ) 

{lg,N} 

with k\ + &2 + ... + k.N = 2L. For most networks of 
physical interest, Zn is astronomically large compared to 
one, but vanishingly small compared to Zy. For instance, 
Chen et al. [|[ numerically estimate the number of 12 x 
12 (0,1) matrices with each row and column sum equal 
to 2 (and with no restrictions of symmetry or vanishing 
diagonal) to be « 2.196 x 10 16 , which agrees well with 
the exact number found by Wang and Zhang [2l|. This 



number is much smaller than the number of all 12 x 12 
(0,1) matrices with 24 ones, which is ( 24 ) ~ 1.69 x 
10 29 . Despite efforts by these and other mathematicians 
over decades [2l|, [22[, no well-developed, exact analytical 
approaches are known for these combinatorial problems, 
but advanced computational methods exist, as described 
in Section III. 

C. Partially Fixed Degree Sequences 

On the one hand, the difficulty of enumerating the 
number of graphs in the FDS ensemble suggests strong 
correlations in the graphs, since similar problems in sys- 
tems lacking correlations can often be solved exactly. In- 
deed, the FDS ensemble makes very different predictions 
from the Erdos-Rcnyi (ER) ensemble, showing that tak- 
ing into account some information about the nodes' de- 
grees is crucial. 

On the other hand, it might be the case that not all 
the constraints in the FDS ensemble must be taken into 
account simultaneously. After all it is the simultaneous 
fixing of all the marginal sums in the matrix that makes 
the calculation of Zm difficult. Perhaps taking into ac- 
count all the constraints, but not requiring them to be 
simultaneously enforced, is already sufficient to capture 
some nontrivial aspects of the FDS ensemble. If this is 
possible, then we must also identify which specific small 
subset of the nodes' degrees gives the most reliable esti- 
mate of various expectation values in the FDS ensemble. 

Here, we study ensembles where the degrees of a very 
small subset of the nodes are simultaneously fixed - the 
other degrees being arbitrary up to the constraint on the 
total number of links. We also demand that no more 
than one link may connect any two nodes in the net- 
work, and disallow self-connections. All graphs satisfying 
these constraints have equal weight. Those not satisfying 
these constraints are given zero weight. These ensembles 
can all be viewed as sub-ensembles of the ER ensem- 
ble. For each possible degree subset, we calculate the 
different partition functions corresponding to all possible 
subgraphs of a certain size. From these we can approxi- 
mate expectation values of various quantities in the FDS 
ensemble. 

In the following we shall always label the nodes such 
that the first m degrees, k%, ...k m , are fixed. We call 
the resulting ensembles PFDS(to+ 1): partially fixed de- 
gree sequence with m + 1 constraints (the final constraint 
comes from fixing the number of links, L). Clearly, 
putting more constraints on the ensemble of labelled 
graphs diminishes its size until, when each and every 
edge is specified, the ensemble contains just one mem- 
ber - the real network Q being studied. For ensembles 
with increasing numbers of link constraints this implies 
that Z m+ i decreases monotonically with m + 1, and 

1 = Zg <C Z N (N,L,k 1 ,k 2 , ...,k N -i) 

< Z m+1 (N, L, fci, ...k m ) < Zx(N, L) , (5) 
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for 1 < m < N — 1. 

We find that fixing only the degrees of the nodes par- 
ticipating in the small subgraph (e.g. link or trian- 
gle) under consideration, with explicit exclusion of self- 
connection and multiple-connections between any nodes, 
already gives a good approximation to the disassortativ- 
ity (and to other properties) in the FDS ensemble. As 
noted above, this uses information about the whole de- 
gree sequence, but in each contribution corresponding to 
one specific (labelled) subgraph only part of this infor- 
mation is used. 

The information stored in the degree sequence is most 
important when its distribution is very wide. Even for 
networks exhibiting broad degree distributions, such as 
protein interaction networks or autonomous system maps 
of the Internet, it is sufficient to fix the degrees of the 
node pairs directly involved (as well as the total number 
of links in the network) to obtain a good estimate of (k')k 
and of the disassortativity. In order to estimate the num- 
ber of triangles (i.e. the clustering), one has to fix the 
degrees of node triples. If we fix in addition the degrees of 
(some) hubs, this slightly improves the approximations. 
It is much easier to make analytical calculations for small 
to, with the smallest meaningful m being the size of the 
subgraph being considered. Hence for link likelihoods 
and for (k')k this minimum is to = 2, while for triangle 
likelihoods it is to = 3. By comparing analytic proper- 
ties of the PFDS(m) ensemble with numerical estimates 
of the FDS ensemble, we can assess to what extent the 
correlations in the PFDS ensembles resemble those in the 
FDS and in the ER ensembles. 

To begin, we will focus in Section IV on the link likeli- 
hood, Pij , which is the probability that a randomly cho- 
sen graph from the ensemble contains an edge from node 
i to node j. From this microscopic quantity one can cal- 
culate the standard degree-degree correlations that are 
commonly compared with real-world networks to iden- 
tify statistically significant features. Details of the calcu- 
lation of are deferred to Section V (and to Appendix 
B). There we will also treat the generalization to to > 3 
which is needed in order to estimate the frequencies of 
higher order quantities such as motifs [23[. As an exam- 
ple of a motif calculation, we include an estimate of the 
number of triangles. 



III. BACKGROUND 

A. Monte Carlo algorithms to estimate the FDS 
ensemble 

As for many other problems where one wants to sample 
complex instances from some well-defined ensemble, here 
two basic strategies predominate: Markov chain Monte 
Carlo and sequential sampling [1, [24|, [25[ . For the present 
case, the most obvious and popular Markov chain algo- 
rithm is the rewiring algorithm 0, [26|, [2?], [H, ■ We 
describe it here for directed graphs; the generalization to 



undirected graphs is immediate. We start by making an 
initial network with N nodes, no self-connections, and 
the desired degree sequence, but without paying atten- 
tion to multiple links between pairs of nodes [301 ]. The 
Monte Carlo algorithm proper consists of a sequence of 
moves, randomly chosen from a move set, which con- 
tinues until equilibrium (i.e. uniformity of sampling) is 
reached with sufficient accuracy. A move is initiated by 
choosing randomly four different nodes i,j, I and to with 
Mij > and M hn > 0. If either M im > or My > 0, a 
null move is performed (the graph is left as it is) . If nei- 
ther of the pairs {i, to} and {I, j} were already connected 
by a link, and M\ m are each decreased by 1, while 
M im and Mij are increased by one. This corresponds to 
swapping one pair of links. 

It can be shown easily that this algorithm eventually 
leads to a graph without multiple links (provided such a 
graph consistent with the fixed degree sequence exists). 
After this happens, the algorithm satisfies detailed bal- 
ance (any sequence of moves is equally likely to be cho- 
sen as its reversed sequence) and is ergodic (each graph 
with the same degree sequences can be reached by a suit- 
able move sequence.) As shown in [27j . ergodicity is not 
strictly satisfied, but the few exceptions can be taken into 
account by including three-link exchanges in the move 
set. 

Sequential sampling proceeds, in contrast, by repeat- 
edly building a new graph from scratch. For this we 
start with an empty adjacency matrix and fill its entries 
randomly. In the simplest version, this is done without 
paying any attention to the degree sequence, to the ab- 
sence of self loops, or to the exclusion of multiple links. 
Instead, the candidate graph is discarded if any of these 
constraints are violated. In this way the uniformity of 
the sampling is guaranteed, but the attrition (i.e. the 
chance to reach an illegal configuration) is overwhelm- 
ing, rendering the algorithm useless. But there are more 
sophisticated options for sequential sampling. The most 
efficient algorithm studied in the literature [H uses de- 
tailed mathematical results for the structure of legal ad- 
jacency matrices [3~H to bias the matchings in a much 
more clever way. 



B. Uses of null models 

Statistically significant deviations between a null 
model and a real network point to organizing laws or his- 
torical accidents that are not accounted for by the null 
hypothesis. On the other hand, finding no statistically 
significant deviations would promote the belief that the 
entire structure of the network could be accounted for 
by the model, e.g. by the complete degree sequence in 
case of the FDS ensemble. This process of building null 
models can, in principle, be iterated to understand the 
full set of organizing principles or physical constraints on 
a network: one builds a null model, tests for significant 
deviations, and then builds a new null model with richer 
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structure to try to reduce any significant deviations to 
typicality. 

Through an app lication of this discriminatory method, 
Maslov et al. |33| showed that a significant part of the 
dissortativity [34j observed in the Internet could be at- 
tributed to the broad degree distribution together with 
the restriction of no multiple links between any pair of 
nodes. For a scale free network of N nodes with a degree 
distribution P(k) ~ fc -7 , the maximum expected degree 
k c {N) scales as k c (N) ~ TV 1 ^ 7 " 1 ). In a random network 
with no constraints on edge multiplicities, the expected 
number of edges between the two largest hubs would then 
scale as k c (N) 2 /N ~ jV 2 /(7-i)-i. For 7 < 3, this number 
diverges with N. If the constraint of no multiple edges 
is imposed, these links must be distributed so that they 
connect the hubs to other nodes. This creates fewer links 
between hubs than naively expected, and more links be- 
tween hubs and nodes with small degree; it also leads to 
a suppression of links between nodes with small degrees 
(as the degrees of these nodes are "used up" by connect- 
ing to hubs). The net effect is that fixing a broad degree 
sequence decreases assortativity (the preference for nodes 
with similar degrees to be connected to each other) [33[ . 

On the other hand, Milo et al. [23[ have discovered 
subgraphs or motifs that are significantly more frequent 
in actual networks than in the corresponding FDS en- 
semble. Identifying these motifs allows a classification 
of networks that share the same motifs. For instance 
feed-forward loops are overrepresented in gene regulation 
networks and in some electronic circuits, while fully con- 
nected triangles are most overrepresented in the world 
wide web. 

So, on the one hand we see that the FDS ensemble, to- 
gether with non-trivial (power law) degree distributions, 
allows both discrimination between features of the net- 
work and comparison with other networks; on the other 
hand, the ensemble itself exhibits strong correlations. To 
explain how these correlations are related to each other 
and to the degree sequence it is useful to have an analytic 
approach. This is also important if one wants to develop 
more refined null models, or study very large networks 
for which rewiring is prohibitive. While most authors 
have considered the FDS ensemble as the most natural 
null model for networks, there have been attempts to 
generalize to more complex ensembles. Maybe the most 
interesting is due to Mahadevan et al. 



C. Previous analytic approaches 

The present paper builds on a paper by Burda et 
al. [20l | . An alternative strategy to incorporate infor- 
mation on degree distributions was proposed by Park 
and Newman pH , Il9j . While we fix the degree sequence 
{k(m),m = 0,1,...} exactly, Park and Newman con- 
strain only the average numbers (k(m)), averaged over 
the ensemble. Thus, while our approach is microcanoni- 
cal, the one of [H, [lj| is grand canonical. As in statisti- 
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FIG. 1: Log- log scatter plot of the naive analytic estimate 
of link likelihood k ou t,iki n ,j / L (directed network, St. Mar- 
tin foodweb [3?fl or kikj/2L (undirected network, Eschericia 
coli protein interaction network [3^1) versus the ratio of the 
Monte-Carlo rewiring estimate to the naive estimate for the 
corresponding nodes. Note that the directed network has con- 
siderably more scatter for given k ou t t i,kin,j. 



cal mechanics, calculations are often simpler in the grand 
canonical ensemble, but they are feasible and not too dif- 
ficult for the PFDS(m) ensemble considered in this paper, 
with to small. Note that for finite sized networks, the two 
ensembles are not equivalent. Further, for a given net- 
work, physical arguments may suggest that one ensemble 
is more explanatory than another. 



IV. LINK LIKELIHOODS p y AND 
DISASSORTATIVITY IN NULL MODELS 



For undirected networks, all pairs of nodes with the 
same degree have the same likelihood to be connected in 
the FDS ensemble. For directed networks the likelihood 
to form a link from node i with out-degree k out ^ to node j 
with in-degree fcmj also depends on ki n ^ and k ou tj- This 



is demonstrated in Fig.Q] where the actual link likelihood 
estimated using a rewiring algorithm is plotted vs. the 
naive analytic estimate of the link likelihood k out ^ki n j j L 
(for pairs in the directed networks) or kikj /2L (for pairs 
in the undirected network). In particular, directed net- 
works exhibit a high degree of scatter for the same values 
of the connected (out- and in-) degrees, showing the im- 
portance of the other degrees associated with the pair 
(in- and out-, respectively). Further, the likelihood does 
not approach the naive estimate for kikj -C L. This is 
due to the constraint of one link between two nodes and 
to the presence of hubs, which thus have to distribute 
their links to different nodes. 

For any ensemble A, let NA(k, k') be the average num- 
ber of links between nodes with degree k and nodes with 
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degree k'. In terms of the link likelihood, 

N A (k, k') = J^ P ij,AS(ki - k)6(kj - k') 



(G) 



where the sum over i , j indicates a sum over all pairs of 
nodes, and Vij,A is the link likelihood for ensemble A. If 
the ensemble A is the trivial ensemble consisting of just 
one network, namely the experimentally observed graph 
Q with adjacency matrix M.g, then Pi^A = Mg^j. 

The average degree (k')k of neighbors of nodes with 
degree k is 



E k ' k'N(k,k') 
E* N(k,k') 



(7) 



where we have dropped the subscript ll A" for brevity. 
This quantity can be related to the (dis)assortativity, i.e. 
the tendency of nodes to connect (less) preferentially to 
nodes with similar degree. The assortativity was formally 
introduced by Newman as the Pearson correlation coef- 
ficient for the degrees of any two nodes connected by an 
edge H3). Intuitively, when the average degree (k')k is 
an increasing function of k then the network shows assor- 
tative mixing, i.e. nodes of low degree tend to connect 
to nodes of low degree and nodes of high degree tend 
to connect to nodes of high degree. When (k')k is flat, 
the network shows no assortativity, and when (fc')fc is a 
decreasing function of k then the network shows disas- 
sortative mixing [36| . 

We can compute (k')k in several PFDS ensembles. 
The ensemble Z^(kx, k 2 ) consists of uncorrelated random 
graphs with N nodes, L edges and no multiple or self- 
connections, where we fix the degrees of one pair of nodes. 
Evidently, we choose the pair whose link likelihood is be- 
ing evaluated. Eq. ([7]) is then calculated by averaging 
over all pairs of nodes in the network. This clearly allows 
us to take the whole degree sequence into consideration, 
although only pairs of node degrees are considered si- 
multaneously. To include the presence of a hub, we work 
in Z^ki, &2, fcmax), the ensemble of uncorrelated random 
graphs with N nodes, L edges and no multiple or self- 
connections, where we fix the degree of the pair of nodes 
whose link likelihood is being evaluated, as well as the 
degree fc max of the strongest hub. 

As shown in Section V and in Appendix B, we can 
compute (k')k exactly in Z^(k\, k 2 ) and Zi(ki, k 2 , fcmax), 
as well as in the approximate Zs(ki,k 2 ) ensemble with 
Pij given by Eq.((T]) [see also Eq. (|22[) below], which be- 
comes exact in the limit of large N for sparse networks, 
and for fc max <C L. In Fig. [5] we plot (k')k versus k for 
an Escherichia coli protein interaction network [38| . The 
FDS ensemble, as sampled by the Monte Carlo rewiring 
procedure, is clearly disassortative, while the estimate of 
(k')k using the standard naive estimate = kjkj/2L 
shows no disassortativity or assortativity, as expected. 
We note that forbidding self-connection but otherwise 
using the naive estimate for p^ results in slight disas- 
sortativity, while approximate Z3, exact Z^(ki, k 2 ), and 
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Naive 
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FIG. 2: Average degree of the neighbor, {k'}k vs. node de- 
gree k for an Escherichia coli protein interaction network (38(] 
in several ensembles. The FDS ensemble, sampled by Monte 
Carlo rewiring, shows disassortativity as {k)k is a decreasing 
function of k. For the naive estimate pij — kikj/2L and using 
the exact degree sequence, there is no disassortativity (while 
using a power law as in Eq. (|10|l leads to divergence). How- 
ever, using the naive estimate but forbidding self-connection 
results in slight disassortativity. Approximate Z3, exact Z3, 
and Z&(ki, ki, fcmax) are increasingly refined estimates of the 
FDS ensemble. Notice that the latter two can hardly be dis- 
tinguished. 



Zi(k\, k2, fcmax) are increasingly refined estimates of the 
FDS behavior (see Fig. [5]). Finally, Z^ki, k 2 , fcmax) gives 
only a slight improvement over Z3, indicating that hubs 
per se arc less important to global properties such as dis- 
assortativity than constraints such as no self- or multiple 
connections, which are already implemented at the level 
of Z3, along with information about the whole degree 
sequence, taken in degree pairs. 

The approximate Z3 ensemble is of further interest be- 
cause (fc')fc can be calculated analytically. Note that 



N(k,k') 



= P{k,k')N{k)N{k'). 
For a power law degree distribution, 

N(k) = (7 - l)Nk-i 



(8) 



(9) 

where TV is the number of nodes in the network and the 
degree distribution of the network is a power law with 
exponent 7. In this case the sums over k' in Eq. ([7]) can 
be approximated by integrals, yielding 



(fc'> 



k'P(k, fc')fc'- 7 dk' 
f™ P(k, k')k'-y dk' 



(10) 



If we approximate P(k, k') by Eq. (fl]), these integrals can 
be solved in terms of hypergeometric functions: 



{k') k = 



7-I2FU1, 7-2;7-i;nr) 

7-2 ^(l^-l;^^) 



(11) 
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FIG. 3: {k')k versus fc for the Internet at the AS level from 
[39| . We plot the analytic estimate of Eq. for 7 = 2.3; 
for comparison we plot the results in the approximate Zz and 
i?3 ensembles, computed directly from the degree sequence of 
[3g| | . as well as Monte Carlo rewiring estimates of the FDS 
ensemble. 



We note that for 7 = 2.5 the hypergeometric functions 
can be expressed in terms of elementary functions: 



(k>) 



arcsm 



(v5 



2L+k 



(^)i/2-(^)arctan((A)-i/2) 



(12) 



To test the validity of Eq. (fTTj) . we turn to a large 
network, specifically Newman's recent AS-levcl Internet 
data for which L = 48436. In [3 it is reported that 
7 rs 2.2 ± 0.3 for the Internet; we estimate 7 f=a 2.1 ± 0.3 
for Ref. 0. 

In Fig. [3] we plot the analytic estimates of (fc')fc given 
by Eq. (TlT]) for 7 = 2.3. This value for 7 gives the best 
fit for (k')k and is within the uncertainty of the direct 
degree distribution measurement of 7. For comparison 
we also show the results in the approximate Z3 and Z% 
ensembles, computed directly from the degree sequence 
of as well as Monte Carlo rewiring estimates of the 
FDS ensemble. Note the strong similarities between the 
Z3 results and the Monte Carlo estimates of the FDS 
ensemble; in particular, we observe a flattening of (k')k 
for small k both in the Z3 ensemble and in the Monte 
Carlo rewiring. This is consistent with the observations 
of [3. 

Also note the similar scaling of the various estimates 
and of the Monte Carlo results at large k. For the Inter- 
net, (fc')fc has been reported to scale with k as a power 
law, (k') k ~ k~ u with v « 0.5 Our Monte Carlo 

results for the FDS ensemble, using the degree sequence 
of Ref. [jOI, show indeed such a power law for large k, 
but with v Fts 0.75. The exact and approximate Z3 cal- 
culations, obtained with the exact degree sequence, give 
v rj 0.7 resp. 0.62. When using a power law degree se- 
quence and Eq. (jTTJ) , the scaling depends on 7. But in 



this case, one can verify that scaling does not hold in the 
large k limit, but in the limit k <C L. The curvature of 
the continuous line visible in Fig. [3] results entirely from 
the fact that k is not much less than L. Thus it is the 
slope of the continuous line at small k which should be 
used for extracting v. With this, one finds that v varies 
from w 0.79 for 7 = 2.2 to 0.5 for 7 = 2.5. Note that 
v = 0.5 for 7 = 2.5 is an exact result that can be obtained 
analytically by taking the limit 2L/k/tooo in Eq. (|12p . 

In contrast to the approach of [18[ , all of our results can 
be computed directly from the degree distribution or the 
degree sequence, omitting the intermediate step of con- 
structing a fugacity distribution to match the statistics 
of the degree distribution and then extracting (k')k from 
the fugacities. The fact that the disassortativity prop- 
erties of the Internet can be studied so directly in the 
simple approximate Zj, ensemble suggests that Eq. (TTJ) 
should replace the naive estimate P(k,k') = kk' /2L in 
other applications, for example in the study of motifs. 
This is explored in Section V.B (see Eq. |2"5"|) . 

For undirected networks and any null model A, 
Maslov et al. HH defined a quantity called the corre- 
lation profile R(k,k') = Ng(k,k')/N A (k,k') and the Z- 
scorc Z(k, k') = [Ng{k, k') - N A (k, k')]/cr A (k, k'), where 
N A (k, k') is defined as in Eq. Ng(k, k') is the analo- 
gous quantity for the trivial ensemble (with Pij replaced 
by Mg t ij), and a A (k,k') is the variance of the number 
of links connecting nodes with degrees k and k' in en- 
semble A (remember that N A (k,k') was the average of 
that number). The specific null studied in [33[ was the 
FDS ensemble. As shown in Appendix A, a similar anal- 
ysis can be done comparing different null models to each 
other. Results are also discussed in Appendix A. 



V. ANALYTIC ESTIMATES OF THE FDS 
ENSEMBLE FOR UNDIRECTED NETWORKS 

A. Notation and Basic Identities 

We now derive our principal analytic results. Our 
central object is the partition function Z, which counts 
the number of graphs in the ensemble. The elemen- 
tary constraints on the network (N nodes, no multiple 
or self-connections) imply that the adjacency matrix A4 
is N x N, is symmetric (for undirected networks), has 
zeroes along the diagonal, and consists solely of 0's and 
l's. If wc add the constraint of L links, the partition 
function can be written as 



Z 1 (N,L)= J2 KL-^Mij) 



(13) 



{M ij= 0,l} 

i<j 



where the sum is over the upper triangle of M. due to 
symmetry. A simple computation gives Eq.Q for the 
number of ways to distribute L links among ( 2 ) possible 
pairs of nodes. 
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FIG. 4: Schematic decomposition of the adjacency matrix M 
into the three sub- matrices A, B, C discussed in the text. Note 
that M is symmetric for undirected networks. 



Now let us specify the degrees of to of the nodes. We 
refer to m as the "order" of a calculation. The partition 
function becomes 

Z m+1 (N,L,ki,...,k m ) = S ( L ~zZ Mi ^ 



i<j 



m l-l N 



1=1 



where we use the symmetry of the adjacency matrix to 
write the degree constraints in terms of the variables M^- 
with i < j. 

It will be helpful at this point to introduce some further 
notation to assist us in organizing this calculation. We 
split the matrix M. into four pieces: 

• A, the square submatrix controlling the edges link- 
ing the to nodes with fixed degree to each other; 

• B, the rectangular matrix encoding the connections 
of the to nodes of fixed degree with the rest of the 
nodes in the graph, and its transpose B T \ 

• C, the square submatrix encoding the edges among 
the N — to remaining nodes (whose degrees are not 
specified) . 

Due to the symmetry of M. only B and the upper tri- 
angular parts of A and C are independent. In Fig. 0] we 
present a schematic decomposition of M.. 

The sum over all adjacency matrices decomposes into a 
sum over the A, B and C sub-matrices, with suitable con- 
straints. In particular, we write the symbol E{.4} f° r the 
sum over all possible values {0, 1} of the matrix elements 
of the submatrix A. Each term of this sum corresponds 
to a particular (possibly disconnected) labelled subgraph 
involving to nodes of fixed degree. This is analogous to a 
diagrammatic expansion of the partition function, where 
the partition function is now written as the sum over 
all possible subgraphs involving nodes 1 through to, and 



each subgraph is weighted by a degeneracy factor result- 
ing from the summations over B and C. This degener- 
acy counts the number of possible graphs in the ensemble 
containing that particular subgraph (respectively subma- 
trix) A. The partition function is written in this notation 
as: 



Z m +i 



(15) 



where Z m+ i(A) is the partition function or degeneracy 
factor for a given fixed submatrix (equivalently, sub- 
graph) A. For each Z m+ i(A), the nodes with fixed de- 
grees (i.e. the first to nodes) are connected in a specified 
way. Thus, for example, the probability of some partic- 
ular to x to subgraph A occurring would be 



Prob(^) = 



Z 



m+l 



(A) 



(16) 



As shown in Appendix B, the degeneracy of a given 
subgraph Z m+ i(A) can be written as 



■>m+l 



(A) 



L + Si<i,=2 Ai 3 — Xa=i 



II /■ 



N - TO. 



(17) 



The first term on the right hand side is the degener- 
acy associated with the upper half triangle of the square 
submatrix C. Recall that the submatrix C defines con- 
nections between all N — to nodes not in A, i.e. all nodes 
with free degree. This matrix has f"^™) independent 
places to put a specified number of l's. The number of 
l's in C depends on L, the total number of l's in the entire 
(upper triangular) adjacency matrix, minus the number 
of l's from edges that have at least one end on a node of 
fixed degree. By definition, those l's cannot appear in C. 
Due to the symmetry of the entire matrix, the number 
of l's to be placed in the upper triangular half of C is 

£ + El .... 2 Aij - E;=i 

Each factor in the product YYiLi °f Eq. (fT7|) is the 
degeneracy associated with a row in the matrix B. For 
every such row there arc N — m places to put l's, and 
the number of l's that must be placed in the Z-th row is 
the degree of the node ki minus the number of l's in the 
corresponding row of the entire matrix A. This latter 
number is the degree of the node within the subgraph A. 



B. Calculation of link likelihoods for undirected 
networks 

As a first application we compute the link likelihood 
for two nodes in this framework. At lowest order we 
can just specify the degrees of the two nodes under con- 
sideration, giving the ensemble Zj, with to = 2. For 



convenience, the two nodes arc labelled 1 and 2. The 
two possible configurations of this subsystem, where the 
nodes are either connected or disconnected, must be 
weighted by the appropriate degeneracy factors accord- 
ing to Eq. [lTj Let us denote these two configurations 
by H and H'. In H an edge connects the two nodes, so 
H12 = 1. We denote this "connected" part of the to- 
tal partition function Z^ onn = Z 3 (H,). In W there is 
no edge between the two nodes, so W 12 = 0. We de- 
note this "disconnected" part of the total partition func- 
tion Z$ lsc = Z 3 (H'). The total partition function is thus 



® 



'ytotal r7conn i <7di 

Zj z — Zj z + z 3 



Z 3 (K) + Z 3 (W), thus 



(m=2) 
Pl2 



gconn 
ytotal 



Z 3 (K)+Z 3 (W) 
From Eq. 1171 the explicit expressions are 



L 



(V) 
1-ki 



n 

i=i 



N ■ 
k - 



n 

i=i 



N -2 



(18) 



(19) 



(20) 



Straightforward calculation gives 



(m=2) 

Pia 



(21) 



(L + l-h- k 2 )(N - 1 - h)(N - 1 - k 2 
kik 2 ({N - 2)(N - 3)/2 - L + h + k 2 ) 



and in the limit L 
reduces to 



oo, L/N 2 -> 0, and fc 4 <C L this 



(m- 
Pl2 



fci k 



1K2 



2L + k x k 2 



(22) 



In the limit k\k 2 « L we recover from Eq. (f2"2"| the naive 
estimate k\k 2 /2L used by most authors. This naive esti- 
mate is a bad approximation if either node 1 or node 2 is 
a hub. In general, the full expression given in Eq. (21) is a 
better approximation, although as we have shown in sec- 
tion IV the approximate Z 3 ensemble given by Eq. (|22l) is 
both analytically tractable and significantly better than 
the naive estimate. 

The presence of any hubs in the network reduces the 
link likelihood between two nodes, particularly nodes of 
low degree, as their links are "stolen" by the hubs. This 
effect already appears in the calculation of the disassor- 
tativity, as shown in Figures [2] and [3] We can refine the 
preceding computation by including constraints on the 
degree sequence coming from the hubs. We incorporate 
these constraints from the hubs by considering the nodes 
with the highest degrees first. The partially fixed de- 
gree sequence ensemble Z^(ki, k 2 , fe mH ), then, includes 
the two nodes ki , k 2 whose link likelihood we compute 
(free to vary over all node pairs) and the largest hub 



^conn 



m — m E 




(a) 



^disc 



m m m m m m s 




(b) 



FIG. 5: Diagrammatic expansion for the connected (a) and 
disconnected (b) contributions to the partition function Z4 
for P12 . 



in the network, with degree (fc max ). In the case that 
tax = fci or k 2 we use the next largest hub degree for 
the third constraint. In the sub-matrices A, all possi- 
ble ways to connect the three nodes are enumerated. To 
compute the link likelihood p^™~^ we divide the par- 
tition function into connected Z1° nn and disconnected 
Zf lsc parts, where connected means an edge connects 
node 1 and node 2, i.e. A\ 2 = 1. The diagrammatic ex- 
pansion for the connected sub-partition function is shown 
in Fig. [S^,. and the disconnected sub-partition function 
is shown in Fig. [SJa. 

So we can compute the link likelihood as 



(m=3) 
Pl2 



^conn 



^conn _|_ gdisc 



(23) 



Z 4 (Ni) + Z 4 (K 2 ) + Z 4 (H 3 ) + Z 4 (K 4 ) 

where Hi to Hs denote the adjacency matrices for the 
eight graphs shown in Fig. [5] In Fig. [5] we compare the 
link likelihood pij for an undirected network in the FDS 
ensemble (obtained numerically by the rewiring method) 
to the analytic results in the ensembles Z 3 (k\,k 2 ) and 
Zi(ki, k 2 , fcmax)- For each pair {i, j) we plot the naive es- 
timate on the horizontal axis; the vertical coordinate 
is the ratio of the numerical (Monte-Carlo) estimate to 
the naive, no hub, and one hub estimates for that pair. 
The Z 3 estimate is already a substantial improvement 
over the naive analytic estimate, with slight further re- 
finement coming as expected from the inclusion of hubs 
in all the diagrams. 



Calculation of subgraph likelihoods for 
undirected networks 



Estimating the likelihoods of larger subgraphs can be 
done along exactly the same lines. As an example, we 
estimate the number of triangles in an undirected net- 
work and test this estimate on an Escherichia coli pro- 
tein interaction network (38j . a yeast (Saccharomyces 
cerevisiae) protein interaction network [40(, two artifi- 
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TABLE I: Comparing estimates of the total number of triangles in various networks with N nodes. MC refers to Monte Carlo 
rewiring estimates of the FDS ensemble. As expected, the results of Eg. 1241 approach the asymptotic result, Eq. (|25p . for large, 
sparse networks. 



Network 
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% Error 


% Error 


% Error 
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Eq. j25j vs. MC 


Eq. JUJ vs. MC 


E. coh 


230 


215.82 


289.65 


322.14 




25.49 


33.01 


10.09 


Yeast (narrow) 


1373 


302.77 


247.65 


339.10 




-22.26 


10.71 


26.97 


Yeast 


1373 


651.07 


592.59 


1160.39 




-9.87 


43.89 


48.93 


Yeast (broad) 


1373 


1553.94 


1667.70 


2813.37 




6.82 


44.77 


40.72 


AS Internet 


22963 


29157.38 


31840.23 


37810.68 




8.43 


22.89 


15.79 


WWW 


325729 


379371.15 


379706.63 


274926.89 




0.09 


-37.99 


-38.11 



CO 

c 

CO 

B 
a 
E 

W 
CD 

O 



1.3 r- 

1.2 - 

1.1 - 

1.0 - 
0.9 
0.8 



0.65 




E. coli MC estimate / naive estimate 
E. coli MC estimate / Z 3 estimate 
E. coli MC estimate / Z 4 estimate 

0.001 0.01 0.1 1 

link likelihood, naive estimate 



FIG. 6: Scatter plot comparing estimates of the link likelihood 
in an Escherichia coli protein interaction network. On the 
horizontal logarithmic axis we plot the naive estimate 
on the vertical logarithmic axis we plot for all corresponding 
nodes the ratio of the Monte-Carlo rewiring estimate to the 
naive, Z3 (no hub), and Z± (one hub) estimates. The latter 
two nearly coincide. 



cial yeast protein interaction networks created by mod- 
ifying the degree sequence of by hand to make it 
narrower or broader, the Newman AS level map of the 
Internet [39 1, and a symmetrized snapshot of the World 
Wide Web 4JJ. Working in an ensemble with four con- 
straints, ^4(^1,^2,^3), we consider all permitted triples 
of nodes k), forbidding self- and multiple connection. 
Note that node 3 is no longer fixed as the largest hub, 
but allowed to range over all nodes. Given fixed nodes 
1, 2, and 3, we can compute the likelihood of a triangle 
between them as 



(m=3) 

Pa 



Z 4 (K 4 ) 



(24) 



where H4 corresponds to the fully connected subgraph, 
i.e. the last term in the sum of Fig. [Ha). The resulting 
combinatorial expression is quite unwieldy. However, in 
the limit L — > 00, L/N 2 — > 0, and 1 <C ki <C L, i.e. the 



large, sparse network limit used in deriving Eq. ([2"2"|) with 
the additional assumption of 1 <C ki, we find a remark- 
able simplification. The expression factorizes to 



(m- 

Pa 



:3) 



P{k 1 ,k 2 )P(k 1 ,k 3 )P(k 2l k 3 ) 



(25) 



where P(ki,k 2 ) is given by Eq. JTJ). 

We now test these formulae against the various trial 
networks. The results are shown in Table I, where "MC" 
represents the averaged triangle count for many Monte 
Carlo rewirings, i.e. a numerical estimate of the average 
number of triangles in the FDS ensemble. The only no- 
ticeable trend is the decrease in the absolute value of the 
% Error (defined as [Eq. - Eq. (J2SJ)] /Eq. $H$) be- 
tween the simple factorized expression of Eq. ([2"5|) and the 
elaborate expression of Eq. ([2~4"| as N increases. This ver- 
ifies the approximation made in deriving Eq. (|25p . The 
time required to make the "complex" estimate given by 
Eq. [M]is roughly equal to the time required to count the 
number of triangles for a single Monte Carlo instance; 
the simple estimate given by Eq. (|25[) is much faster, 
running on the WWW data in roughly 90 seconds (on an 
Intel core duo processor) without any optimization for 
speed. 

It is tedious to verify similar factorizations for larger 
motifs. However, the convergence to Eq. ([25)) in the pre- 
dicted limit provides further evidence for the replacement 
of the naive estimate P(k, k') = kk! /2L used e.g. in [42| . 
where factorizations as in Eq. (|25p were assumed, with 
Eq. |T|). This will be the topic of future work. Such ex- 
tensions of our method can also be used to study larger 
motifs in complex networks [H,[l4j], or to study large net- 
works where computational time for rewiring grows pro- 
hibitively, but the approximation underlying Eqs. 
and (1231) should still be valid. 



VI. CONCLUSION 

Detecting and describing local structure is an impor- 
tant frontier in the study of complex networks, as many of 
the features distinguishing real- world networks from their 
random analogues or null models are local: degree-degree 
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correlations, motifs, and so forth. One of the major ob- 
stacles to this project is the lack of analytical techniques 
to study the fixed degree sequence ensemble, which is 
the most common null model for complex networks as- 
sociated with the rewiring method. In this paper we 
have reviewed the numerical tools for studying the FDS 
ensemble and discussed some of the practical uses (e.g. 
disassortativity, motif calculation, correlation profiles) to 
which knowledge of local structure can be put. Through 
a careful study of the partition function of the FDS en- 
semble and the PFDS ensembles containing it, we derive 
simple and general combinatorial expressions that im- 
prove naive estimates of the link likelihood by explicitly 
including important constraints from the FDS ensemble 
(the exclusion of multiple edges and self connections, and 
the appearance of a broad range of degrees) in a "Gaus- 
sian" type approximation where the set of degree con- 
straints are treated minimally but non-trivially. 

In particular, for undirected networks we have de- 
veloped the analytically tractable approximate Z3 en- 
semble, where the link likelihood P(k, k') = kk' /(2L + 
kk') (Eq. (Q])) gives clear disassortativity, while the 
naive estimate kk' /2L does not. We have also intro- 
duced a diagrammatic expansion of the PFDS parti- 
tion function, which organizes the combinatorial calcu- 
lations usefully and leads to a simple, approximate fac- 
torized formula for the estimated number of undirected 
triangles between three nodes, P^~ 3 \ki, k 2 , k 3 ) = 
P(fci,fc 2 )P(fc 1 ,/c 3 )P(fc 2 ,fc 3 ) (Eq. (EQ)) where k 1: k 2 ,k 3 
are the degrees of the 3 nodes. The factorization suggests 
the application of Eq. (j2"2")l to extended local structures 
such as motifs. 

It should be emphasized that these analytic results are 
not merely useful for the null model they have been ex- 
plicitly developed to approximate (the FDS ensemble). 
They also provide guidance in developing more com- 
plicated null models that incorporate higher-level con- 
straints. The astronomical Z-scores observed in work on 
extended motifs [43|, [44| dramatize the need for such ex- 
tensions, which might constrain, for example, the number 
of triangles in addition to the number of nodes, number 
of links, and a few degrees. Further work will explore 
applications of the approximate Z3 and other PFDS en- 
sembles to motif estimates, as well as the incorporation 
of higher-level constraints in the PFDS ensemble to im- 
prove likelihood estimates of extended motifs. The ex- 
tension of the results of this paper to directed networks 
is in preparation. 
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VII. APPENDIX A: COMPARING NULL 
MODELS VIA THE CORRELATION PROFILE 



We can study how the real network deviates from var- 
ious null hypotheses by calculating R(k, k') with respect 
to various null hypotheses. This provides an overall mea- 
sure of how close the ensembles are to each other and 
helps establish the relevant features that distinguish the 
real network from the different ensembles. 

In general, we can define correlation profiles and Z- 
scores for any pair {A, B) of ensembles: 



RA\B{k,k') 



N A (k,k>) 
N B (k,k')' 



(26) 



and 



Z AlB (k,k') 



N A (k,k')~N B (k,k') 



(27) 



In particular, we can calculate the correlation profile for 
the numerically sampled FDS ensemble and the ensem- 
ble ^3(^1,^2)- Recall that the ensemble Z3 consists of 
uncorrelated random graphs with N nodes, L edges and 
no multiple or self- connections, where we fix the de- 
gree of the pair of nodes whose link likelihood is being 
evaluated. We might also compare the FDS ensemble 
to Zi(ki, ki, fc max ), the ensemble of uncorrelated random 
graphs with TV nodes, L edges and no multiple or self- 
connections, where we fix the degree of the pair of nodes 
whose link likelihood is being evaluated, as well as the 
degree of the largest node in the network, fc max . In Z3 
and Z4, pij can be calculated exactly (see Section V). We 
plot -Rz 3 |fds(^j k') f° r the Escherichia coli protein inter- 
action network in Fig. \7\ It is clear that the Z3 ensemble 
captures many of the features of the FDS ensemble, as 
Rz 3 \FDs(k,k') is close to 1 for all k,k'. Rz 4 \FDs(k, k') 
exhibits slight improvement over Z3, as expected (data 
not shown). The correlation profile allows us to identify 
correlations due to the degrees of the other nodes in the 
network and provides a test of our hypothesis that the 
PFDS ensemble captures much of the structure of the 
FDS ensemble. 
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FIG. 7: Correlation profile Rz 3 \FDs(k, k') for the Escherichia 
coli protein interaction network [33|. Note that the dark blue 
regions are artificially set to the value .7; they correspond to 
values of (k, k') for which no data exist. 



VIII. APPENDIX B: DERIVATION OF Z m+1 FOR UNDIRECTED NETWORKS 

We note that the number of possible undirected subgraphs of m nodes is 2< m ~ m )l 2 . So we write: 



Z m +i(N, L, ki, k m ) 



j—ra j=N i—N 

E E E 5 ( L -(E A ^)-( E B ^-( E G *)) 



m i — 1 m -/V 

xlpto-5>*- E ^- E B «) 

For concision we henceforth write the sum more compactly, as in the next equation. 
We now Fourier transform the delta-functions: 



i<3 



1 



Z m+1 (N,L,k U -,k m )= E ( 27r )m+i 



{ABC} 



7T flX 



— 7T «/ — 7T 



dzdai...da m e i= 2 



m N 



N 



2=1 



Z=l j=m+l 



We can do the sum over yielding: 



1 /*7r /*7r w(i-Ej<7^«) 

Z m +i(iV, L,fci,...,A; TO ) = 2^ /r> / ... / dzdai...da m e 

{AC} 



A' 



TT e «U^-i:i=i^-E^ l+ iA i3 ')TT( 1 + e -^+aO)iV-m -Q g-wCy 



;=i z=i 
Performing the standard binomial expansion yields: 

<"* iz{L-Y%^ An) 

dzda\...da m e j= 2 



(N, L, ki, k m ) = 



{AC} 



1 

Wf^ 1 J-ir 



m N—m 



JT e *«f(*«-£}=iA»-i:£i +I ^y)TT 



i=l n l= 



N - m 



N 



i;(z+a,) 



-izdj 



i<3 

i—m+1 



(28) 



(29) 



(30) 



(31) 
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Integrating over ai...a m sets ni = ki — Ej=i ^-jl ~ Ej=z+i ^-Iji so we have: 

1 /■* «(L-£'= m A M ) 
Z m +i{N : L,ki,...,k m ) = — — / dze j= 2 

MO ^ 

m /AT \ N 

i— m+l 

We now sum over Cy and perform the binomial expansion of the resulting quantity: 



1 /■* wCi+r^^Ay-E^ifcO 

Z m+ i(7V, L,ki,...,k m ) = 2 , y rfze i=2 



x 



m / , 7 \ ( 2 ) s/N-m\ 



! ~ Z)j=i Atf Z)j=i+i 



where we have used the fact that 
i=i 

by the symmetry of the adjacency matrix. We may now integrate over z to give the result: 

/ tN-m\ \ m / N -<m \ 

Z m+1 (N,L,k 1 ,...,k m ) = Y l , rj _vW V v ™ .jnU ^4 V™ 4 (34) 

This can be written as a sum over sub-partition sums Z m+ i(A), each of which is given by Eq. (|17p . Thus we recover 
the result for fixed A given in Section V. 
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